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Assistant Commissioner for Patents 
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Applicant herewith submits to the United States Designated/Elected Office 
(DO/EO/US) the following items and other information under 35 U.S.C. 371: 



1 . This express request to immediately begin national examination procedures (35 
U.S.C. 371(f)). 

2. The U.S. National Fee (35 U.S.C. 371(c)(1)) and other fees (37 CFR 1.492) 
as indicated below. 

3. A copy of the International application (35 U.S.C. 371 (c)(2)): 

a. [x] is transmitted herewith 

(International Publication No. WO 99/39282 ). 

b. [ ] is not required, as the application was filed with the United States 

Receiving Office. 

c. [ ] has been transmitted by the International Bureau. A copy of Form 

PCT/IB/308 Is enclosed. 
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4. [X] A translation of the International application into the English language (35 

U.S.C. 371(c)(2)) is transmitted herewith. 

5. Amendments to the claims of the International application under PCT Article 34: 

a. [X] are transmitted herewith. 

b. [ ] have been transmitted by the International Bureau. 

6. [X] A translation of the amendments to the claims under PCT Article 34is 

transmitted herewith. 

7. A copy of the international examination report (PCT/IPEA/409) 
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b. [ ] is not required as the United States Patent and Trademark Office 

was the IPEA, 

8. Annex(es) to the international preliminary examination report 
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b. [ ] is not required as the United States Patent and Trademark Office 

was the IPEA. 

9. [x] A translation of the annexes to the international preliminary examination 

report is transmitted herewith. 

10. [ ] An oath or declaration of the inventor (35 U.S.C. 371(c)(4)) complying 

with 35 U.S.C. 1 15 is submitted herewith. 

11. An International Search Report (PCT/ISA/210) 

a. [X] is transmitted herewith. 

b. [ ] has been transmitted by the International Bureau. 

c. [ ] is not required, as the application was searched by the United 

States International Searching Authority. 
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12. [ ] An Information Disclosure Statement under 37 CFR 1.97 and 1.98 is 

transmitted herewith, along with Form PTO-1 449 and copies of citations 
listed. 

13. [ ] An assignment document is transmitted herewith for recording, along 

with a separate cover sheet. 

14. [ ] A preliminary amendment is enclosed. 

15. [ ] A verified statement claiming small entity status is enclosed. 

16. [ ] Other: 
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Basic National Fee 


Fee 




IPEA - US 








$670.00 






ISA - US 








$760.00 






PTO not ISA or IPEA 






$970.00 






Claims meet PCT Art. 33(1)- 
(4) - IPEA - US 




$96.00 






Filing with EPO or JPO search 
report 




$840.00 


$840.00 


.^jr ■ 


Enter appropriate basic fee 


$840.00 


' 

'iJ 


Claims* 


Number 
filed 




Number extra 


Rate 




iii 

■if"; 


Total claims 


9 


-20 


0 


$18.00 


$ 


: 


Independent claims 


1 


-3 


0 


$78.00 


$ 


'■;s5f= 


Multiple dependent claims (if applicable) 


$260.00 






Total of above 


$840 00 




Small entity statement enclosed, 1 if Yes, 0 if No -» 


0 


$0.00 




Total national fee 


$840.00 




Fee for recording enclosed assignment 


$40.00 






Total fees enclosed 


$840.00 



^After any attached preliminary amendment reducing the number of claims and/or 
deleting multiple dependencies. 



[x] A check in the amount of $ 840,00 to cover the above fees is 
enclosed. 



[ ] Please charge our Deposit Account No. 18-0988 in the amount of 
$ , A duplicate copy of this sheet is enclosed. 
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WARNING: TO AVOID ABANDONMENT OF THE APPLICATION THE BASIC NATIONAL 
FEE MUST BE PAID WITHIN THE 20/30 MONTH TIME LIMIT. 



1 6. The Commissioner is hereby authorized to charge the following additional fees 
that may be required by this paper and during the entire pendency of this 
application to our Deposit Account No. 18-0988: 

a- [X] 37 CFR 1.492(a)(1), (2), (3), (4) and (5) (filing fees) 

WARNING: BECAUSE FAILURE TO PAY THE NATIONAL FEE WITHIN 30 MONTHS WITHOUT EXTENSION 
(37 CFR S 1.495(B)|2)) RESULTS IN ABANDONMENT OF THE APPLICATION, IT WOULD BE BEST TO 
ALWAYS CHECK THE ABOVE BOX. 

b. [ ] 37 CFR 1, 492(b), (c) and (d) (presentation of extra claims) 

NOTE: Because additional fees for excess or multiple dependent claims not paid on filing or on later 
presentation must only be paid or these claims cancelled by amendment prior to the expiration of the time 
period set for response by the PTO in any notice of fee deficiency {37 CFR 1 , 492(d)), it might be best not 
to authorize the PTO to charge additional claim fees, except possibly when dealing with amendments after 
final action. 



Respectfully subnnitted, 




Direct all correspondence and telephone calls to: 

Neil A. DuChez, Esq. 

RENNER, OTTO, BOISSELLE & SKLAR, P.LL 
1621 Euclid Avenue, 19th Floor 
Cleveland, Ohio 44115 

Tel: 216-621-1113 Fax: 216-621-6165 
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DESCRIPTION 



Scoring of Test Units 



5 TECHNICAL FIELD 

Tha invention relates to a method and a system for scoring 
te:x:t units (e.g. sentences}/ for example according to 
their contribution in defining the meaning of a source 
text (textual relevance)/ their ability to form a 
10 cohesive subtext (textual connectivity) or the extent and 
effectiveness to which they address the different topics 
which characterise the subject matter of the teixt ( topic 
aptness). 

15 BACKGROUND ART 

When abridging a text it is desirable to select a portion 
of the text which is most representative in that it 
contains as many of the key concepts defining the text 
as possible (textusuL relevance). As an example, in 

20 EP-A-741364 (Xerox Corp, } disclosed a method of selecting 
key phrases from a machine readable document by (a) 
generating from the document a multiplicity of candidate 



phrases (units of more than one word), followed by (b) 



1/a 



seleatlng as key phrases a, subset o£ the candidate phrases . 
This selection, known as eummarlsation, may also take into 
consideration the degree of textual connectivity zunong 
sentences so as to minimise the danger of producing 
suminaries which contain poorly linked sentences. 
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Computing lexical cohesion for all pair-wise text unit 
combinations in a text provides axi effective way of 
aja»essing textual relevance and connectivity in paral lei, 
see for example Hoey M- (1991) Patterns of Lexis in Text. 
OUP, Oxford, UK: and Collier A. (1994) A System for 
Automatic Concordance Line Selection- NEMLAP 1994, 
Manchester, UK, A simple way of computing a lexical 
cohesion for a pair of text units is to count non-stop 
words which occur in both text units , Non-stop words can 
be intuitively thought of as words which have high 
informational content. They usually exclude words with 
a very high frequency of occurrence^ e.g, closed class 
words such as determiners, preposition and conjunctions, 
see for example. Pox, (1992) Lexical Analysis and 

Stoplists, in Frakes W and Baeza^Yates R (eds) 
Information Retrieval: Data Structures & Algorithms. 
Prentice Hall, Upper Saddle River, NJ, USA, pp 102-130. 

A sample list of stop words is given below 

a nbout above acrosrs after again against all almost alone 
along already also although always among and another any 
anybody anyone anything anywhere are area areas around 
as asH asked asking asks at away b back backed backing 
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backH be became bccausB bcooTne becomes been before begun 
behind be±ng beings best better between big both but by 
c came can cannot case ca^ees certain certainly clear 
clearly come could d did differ different differently 

do does done dovm downed downing ^ ^ v very w 

want wanted wanting want& was way ways we well 
wells went were what when where whether which while 
who whole whose why will with within without wor3c 
worked working works would x y year years yet you 
young younger youngeet your yours z 

Text units which contain a greater number of shared 
non-stop words are more likely to provide a better 
abridgement of the original text for two reasons s 

the more often a word with high informational content 
occurs in a text ^ the more topical and germane 
to the text the word is likely to be, and 

the greater the times two text units share a word, 
the more connected they are- likely to be. 

As an illustrative example, consider the ranking of the 
following sample text, where digits surrounded by hash 
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characters (#) are text unit indexes* 

#1# Reports Apple looking for a Partner 

#2# MEW YORK (Renter) - Apple is aativaly looking for 
a friendly merger partner, accDrding to several 
executives close to the company^ the New York 
Tlxne9 said In Thnraday. 

#3# One executive who does business with Apple said 
Apple' employees told him the company was again 
in talks with Sun Microsystems, the paper said. 

#4# On Wednesday ^ Saudi Arahia s Prince Alwaleod Bin 
Talal Bin Abdulaziz Al S^ud said he owned more 
than five percent of the computer maker s stock, 
recently buying shares on the open market for a 
total of 9115 million. 

#5# Oracle Corp Chairman Larry Ellison confirmed on 
March 27 he had formed an independent investor 
group to gauge interest in taking over Apple • 

#6# The company was not immediately available to 
comment . 

To compute lexical cohesion according to the method 
suggested by Hoey, (see above reference), all unique 
paiirwise combinations of text units are scored according 
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to how many words they share, as shown In the table 
below . 



Text unit pairs 


Words shared 


Score ^ 




M2# 


Apple, loofc, partner 


3 






#5# 


Apple, Apple 


2 




#1# 


#3# 


Apple, Apple 


2 




ma- 


■ 


company 


1 




#1# 


#4# 




0 






#5# 




0 






#5# 


Apple 


1 




♦•Iff 


#6#. 


0 




#1# 


im 




0 










0 








Apple» Apple^ executive, company 


4 




#2# 


#4# 


0 




^# 




Apple 


1 








company 


1 




#3# 






0 





The numbeir of shared worfls (including multiple occurren 
cos of the same word) In eaah text unit pair provides the 
individual score for that pair. Per example, the individu 
al scores Hor all pairs involving text unit #2# are:- 







#2# 


#3# 




#5# 


#6*! 




3 




4 


0 


1 


1 



Table 1 
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The final score for a given text unit is obtained by sutnming 
the individual Bcores for that text unit. According to 
Hoey (sec above reference), the number of links (e.g. 
shared words) across two text units must be above a certain 
threshold for the two text units to achieve a lexical 
cohesion rank- For example, if only individual scores 
greater than 2 are taken into accounts the final score 
for text unit #2# is ( 3+4«) 7 * Proceeding in the same way, 
the final scores for text units #1# and #3# are 3 and 4 
roerpectively . 

Such a scoring provides the following ranking: 

firsts text unit #2# (final score t 7)y 
second: text unit #3# (final scores 4); and 
thirds text unit #1# (final score: 3), 

A text abridgement can be obtained by selecting text units 
in ranking order according to the text percentage 
specified by the user. For example, a 35% abridgement of 
the text (ie* an abridgement of up to 35% of the total 
ntxmber of text units in the sample text) would result in 
the selection of text units #2# and S3#* 
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Further details about lexical cohesion and the ways In 
which it can be used to aid smnmarlsation can be found 
in Hoey and Collier reierences mentioned above. 

Other prior art on related technology includes , Doi ( 1991 ) 
Method and apparatus for producing an abstract of a 
document - OS patent 50 77668; Ukita et al, (1993) Digital 
Computing Apparatus for Preparing Document Text - us 
patent 52S71B6 ; Withgott et al. Method and appear at us for 
Summarising documents according to theme - US patent 
5364703; and Padersen, J. & J. Tukey (1997) Method and 
Apparatus for Automatic Document Summarisation - US 
patent 563BS43. 

DISCLOSURE OF INVENTION 

It is an object o£ the invention to provide a method and 
system for ranking text units which overcomes at least 
some of the disadvantages of the prior art. 

According to the invention there is provided a method of 
operating on a text Including a plurality of text units, 
each including one or more strings, the method including 
the steps ofi 
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forming a ^tructiare for each of at least some of said 
fftrings, in wh±ch structure the string is associated with 
each teact unit in which the string occurs: 

for each text unit summing the number of occurrences of 
each other text unit in the same etruoture or structures 
so as to form an individual score for each pair of text 
units; and 

processing said individual scores for each text unit in 
order to form a final score for each text unit* 

The use of such structures considerably reduces the time 
taken to operate on the text because it is no Ipnger 
necessary to count the number of strings shared between 
all possible pairs of text units in turn. 

More specif ically, the degree of connectivity of a text 
unit with &11 other text units in e text can be simply 
assessed by quantifying the elements (e.g. words) which 
each text unit shares with pairs built by associating each 
element in the text with the list of pointers to the text 
units in which the element occurs* This provides a 
significant advantage in terms of processing speed when 
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compared to a method such as the one described by Hoey 
(X991) and Collier (1994) where the same assessment is 
carried out by computing all palrwise combinations of text 
un±ti7. In particular, the word-per- second processing 
rate is significantly less affected by text sijse. 

The method may Include the further step of ranking the 
text units on the basis of said Individual scores. 

In one emlsodiment of the invention, said text units are 
sentences, said strings are words forming said sentences^ 
and the method includes the additional steps of removing 
stop-words, stemming each remaining word and indexing the 
sentences prior to carrying out said summing step, and 
said structures are stem- index records each including a 
stemmed word and one qz: more indexes corresponding to 
sentences in which said stemmed word occurs. 

In an alternative embodimenx, said text is associated with 
a word text including words ^ each word being associated 
with one or more subject codes representing subjects with 
which said word is associated, and said strings are subject 
codes associated with said words. 
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In this case tha method may comprise the further step of 
Iceeplng a record of the word spelling associated with each 
occurrence of a aubject code in a text unit, and during 
said aumtnlng step disregarding occurrences of the same 
subject code In a pair of text unite 'if the same word 
spelling is associated with said same subject code in said 
pair of text units. 

It will be appreciated that each word may have a number 
of possible subject codes, some of which are contextually 
inappropriate for the context in which the ward is being 
used. The last -mentioned feature allows the method to 
perform disambiguation of the subject codes, by 
disregarding occiarrences of subject codes which . are 
contextually inappropriate, as will be described in 
greater detail below. 

Said step of disregarding occurrences of subject codes 
may not be carried out for subject codes which relate to 
only a single word spelling in the word text. 

Said processing step may include calculating a level for 
each text unit, in addition to said final score, and said 
level may indicate the value of the highest of said 
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individual scores in relation to a threshold value. 

This allows text units to be ranked first according to 
level, and second according to said final, score, if 
desired. 

The invention also provides a storage medium containing 
a program for controlling a programmable data processor 
to perform the method deecribed above, 

The invention also provides a system for ranking text 
units in a text, the system including a data processor 
programmed to perform the steps of the method described 
above • 

BRIEF DESCRIPTION OF DRAWINGS 

Preferred embodiments of the invention will now be 
described^ by way of example only, with reference to the 
acoompan^Klng drawings, in which: 

Pigxrre 1 shows a flow chart outlining some of the 
steps involved in a preferred embodiment of the 
invention. 
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F±gure 2 shows a flow chajrt wh±ch is a continuation of 
the flow chart of Figure 1? and 

Figure 3 shows an appeiratus suitable for carrying out the 
5 - methods described below. 

BEST MODE FOR CARRYING OUT *rHB INVENTION 
In an embodiment of the invention described below the 
ranking of text units is carried out with reference to 
the presence of shared words across text units. The 
assessment of textual relevance and connectivity can both 
. be carried out by counting shared links (e.g. identical 
words) across all text unit pairs. The method makes it 
possible to perform this assessment by quantifying, the 
eJLementa (e.g. words) which each text unit shares with 
stem-index pairs ^ each such pair comprising an element 
in the text and a list of pointers to the text units in 
which the element occurs. This technique makes it 
possible .to rank text units at a processing rate which 
is significantly less effected by text siae than a system 
where the same assessment is carried out by computing all 
' pair wise combinations of text units. 



10 



15 



20 



The ranking is done by assessing 
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how germane each text unit is to the source text (textual 
relevance ) j 

how well connected each text un±t ±s to other t«xt units 
In the source text (textual connectivity) ; and 

how well each text unit represents the variaus topics 
dealt with in the source text (topic aptness). 

In a. further embodiment described below, the same 
technique is used for aaaoisBment of tiopic aptness • Shaxred 
links across text units ar-e verified in terms of 
overlapping semantic codes associated with words (e.g. 
the connotations business and government for the 
word executive) with reference to a dictionary or 
thesaurus database providing a specification of such 
codes for word entries « 

The method can be divided into two phases, namely a 
preparatory phase, followed by a ranking phase • In the 
preparatqiry phase the text undergoes a number of 
normalisations which have the purpose of facilitating the 
process of computing lexical cohesion. This phase 
includes the following operations: 

text segmentation; 
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removal of formatting commands; 
recognltl.on of proper names; 
recognition of multi-word eacpressionB y 
removal of stop words; and 
word tokenlzation . 

Further ways of normalizing the Input text are also 
mentioned later in the specification- 

The object of segmentation Is to partition the input text 
into text units which stand on their own (e.g. sentences ^ 
titles , and section headings ) and to index such text units , 
for exeunple as shown in the sample text given ahove* 

Next^ formatting commands such as the HTML {hyper-text 
mark-up language) mark-ups in the text are dealt with. 

The sample text including HTML formatting commands looks 
like the ifollowing:- 

<h2>Reporti Apple Looking for a Partner </h2> 
<! Textstart --> 
<P> 

NEW YORK (Reuter) - Apple is actively looking for a 
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friendly merger partner^ according to several executives 
close to the aompany, the New York Times said on Thurs 
day. 
<P> 

One executive who does business with^ Apple said Apple 
employees told him the Gompany was again in talks with 
Sun Microsystems, the paper said. 

<p> 

On Wednesday, Saudi Arabia s Prince Alwaleed Bin 
Talal bin Abdula^i^ Al Saud said he owned more than five 
percent of the computer maker s stock, recently buying 
shares on the open market for a total of $115 million* 
<P> 

Oracle Corp Chairman Larry Ellison confirmed on March 
27 he had formed an Independent investor group to gauge 
interest in taking over Apple, 
<P> 

The Company was not immediately available to comment . 
<!--TextEnd 

In the preeent embodiment, the formatting commands are 
simply removed, but alternative treatments are mentioned 
below . 
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A faaility for rGoogniz±ng proper names and multi-word 
oxpreaaions ±s also Included. Such a f aoiiity makes it 
possible to process expressions such as Apple, New York, 
New York Times, gauge interest as single units which should 
not be further tokenized. The recognition of such units 
ensures that expressions which superficially resemble 

a 

each other, but have different meanings - e.g, Apple (the 
company) and apple (the fruit), or York in New York (the 
city) and New York Times (the newspaper) - do not actually 
generate lexical cohesion links . For further information 
relating to recognising proper nouns and multi-word 
esepreasians reference can be made respectively to David 
McDonald (1996) Internal and External Evidence in the 
Identification and Semantics Categorization of Proper 
Names, In B. Boguraev and J. PasteJov«ky (eds) Corpus 
Processing for lexical Acquisition, MIT Press and Justeaon, 
J. and Katz, S-M., 1995. Technical terminology! some 
linguistic properties and an alorithm for identification 
in text. In Natural Language Engineering, l:9--27. 

Next, all words in the input text which match stop words, 
such as those mentioned above, are removed- This step 
ensures that words which are low in informational content 
are not taken into account when assessing lexical 
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eohesiori. After stop-wora removal, the calculation of 
shared words aaross text units is further optimized by 
tokenlEing non-stop words . word tokenizatlon is achieved 
by reducing words into stems or citation forms, e.g. 



-Input strings 


stems 


citation forms 


actively looking 


activ ioolc 


active look 



Citation forms generally correspond to the maimer in which 
words are listed in conventional dictionaries, and the 
process of reducing words to citation form is referred 
to as lemmatisation. Reduotion of words to stem form 
generally involves a greater truncation of the word in 
which all inflections are removed. The purpose of 
reducing words of stems or citation forms is to achieve 
a more effective notion of word sharing, e.g. one which 
abstracts away from the effects of inflectional and/or 
derivational morphology. Stemming provides a very 
poweirful word tokenization technigue as it undoes both 
derivatlofiai and inflectional morphology. For example, 
stemming makes it possible to eapttare the similarity 
between the words nature, natural, naturally, naturalize, 
naturalizing as they all reduce to.. the stem natur. Word 
redaction to citation form would only capture the 
relationship between naturalize and naturalizing . In the 
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preeent ombodimenl: , stemming wllX be used* For a de- 
soriptlon of some stemming techniques reference can be 
made to, Frakea W, (1992) Stemming Algorithms, in PraJcee 
W and Baeza«- Yates R (eds) Information Retrievals Data 
Structure© & Aigorithma . Prentice Hall, Upper Saddle 
River ^ NJ, USA, pp, 131-160- For further information 
relating to lemmatisation reference can be made to Hadumod 

ID 

Bussmann (1996) Routledge Dictionary of Language and 
Linguistics, Routledge, London, P. 272 Poliowing the 
stages of 'stop-word removal and stemming, the sample text 
is as shown below. 



#1# report Apple look partner 

#Z# New-York Reuter Apple activ look friend merger 
partner accord 

execut close company New-York-Times Thursday 
#3# exeeut busy Apple Apple employ tell company talk 
£tun -Microsystems 
paper say 
#4# Wednesday Saudi-Arabia Prince 

AlWttleed-Bin-Talal-Bin-Abdulaziz-Al-Saud 
own percent computer maker stock 

recent buy share market total 115 million 
#5# Oracle-Corp Chairman Larry-Elliaon confirm 
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March 27 form Independent: 

investor gauge -interest take-over Apple 
#6# company ±mmet3int ' avail cottimetit 

Following the preparatory phase described above, the 
textual relevance and connectivity of each text unit is 
assessed by measuring the number of 0tem? which the text 
unit shares with each of the other text units in the sample 
text. The ranking process comprise£f two main stages: the 
indexing of tokenizad words, and the scoring of tokanized 
words in text units. 

In the first stage^ all stems in the norma^lizad text, 
which has undergone the preparatory phase deBcr;ibed 
above, are indexed with reference to the text units in 
which they occur* For example^ Apple occurs five times 
in four of the text units m the normalised text? once 
in #2#, #5# and twice in #3#- Consequently, a record 

is made \^ere Apple is associated with these text unit 
indexes : 



<ApplQ {#!#•, #2#. #3#, #3#, #5#}> 

A similar record is made for each other stem in the 
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normal±fii6d t«&xi:^ each record being referred to a 
stem-index record. 

A final text unit score ia calculated for each text unit 
using the list of stem-index records resulting from the 
indexing stage described above. The objective of such a 
scoring process is to register how often the tokeniaed 
words from a text unit occur in each of the other text 
units. In perforining this assessment, provisions are 
made for a threshold which specifies the mininial number 
of links required for text units to be considered as 
lexieally cebesive. The recursive scoring procedure is 
used to generate the final scores for each text unit maltes 
use o£ the following variables . 

TRSH^ is the lexical cohesion threshold 
TU is the current text unit 

LC^P is the current lexical cohesion score of TU (i.e. 

H^crU is the count of tokenisod words TU shares 

with some other text unit). 
ciiQvel is the level o£ the current lexical cohesion 

score calculated as the difference between LC^U 

and TRSH 

Score is the lexical cohesion score previously 



wo 99/39282 PCT/JP99/00259 

2X 



assigned TU (±£ any) 
Level is the level for the lexical cohesion score 
previously assigned to TU (if any) 

The scoring procedure makes uete of a scoring structure 
<level, TU, Score>, and is ^repeated f or each text unit 
in turn* in order to produce the final score for the text 
unit TU (ie. the final value of LC^^ in the scoring 
structure) . The procedure can then be repeated for other 
text unitrs TU. The recursive scoring procedure used in 
this exemplary embodiment is as follows. 

if LC^O « 0^ tlien do nothing 

alae, if the scoring structure <Level, TU, 5core> 
exists , then 

if Level > CLevel^ then do nothing 

el8e« if Level = CLevel, then the new scoring 

structure <Level, TU, Score + LC*^^^> 

else, if Clevel > o, then 

if Level > 0 , then new scoring stiructure is 
<1, TU, Score * LC'^^> 

if Level^O , then the new scoring structure 
is <1,TU, LC^> 
else if CLevel^o the new scoring structure is 
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<CLeveI., TU, LcT^> 
alse (±f th6 scoring structure does not exist tHen) 
If CLevel > 0, then create the scoring structure <l, 

eliBo create the scoring structure <CLevel. TU, i:.c^^> 

The atove procedure cem fae more readily understood by 
referring to Figure 1, which showa the procedure in the 
form of a flow chart. In the flow chart decij^ions are 
indicated by diamond-shaped boxes. If the answer to the 
question Within the box is yes , the procedure follows 
the arrow labelled Y at the bottom of the box, other wise 
the procedure follows the arrow labelled N at one of the 
sides of the box. 

The start of the procedure is indicated by step 10. in 
step 12 the index of the first text unit of the 
normalised text is taken and represented by #TU#. In 
step 14 the index of the last text unit is taken and 
represented by #B# . In the sample text given above, the 
last text unit is text unit #6#. The procedure then 
flows to fetep 16 where the lexical cohesion score of #TU# 
and #B# is calculated and assigned to LCTU, This lexical 
cohesion score is the individual score referred to above 
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and showji In Table 1 _ However ^ the manner in which it is 
calculated differs from that descrihed above ^ and will 
now be described. 

Suppose for example, we are scoring t6xt unit #2# (ie, 
#TV# « #2# ) with a lexical cohesion threshold of 2 , First , 
all stem-±ndex records whose stem is present in text unit 
#2# are selected, as shown below. 

<ApplB £#1#, #2#, #3#, #3#, #S#» 
<company {#2#, #3#, #6#>> 
<execut C#2#, #3#» 
<look {#1#. #2#» 
<partner C#l#* #2#}> 

Stems which are associated with only one text unit index 
are eliminated from this list as they simply occur in a 
text unit, but do not connect a pair of text units. 

Then a tuplet is formed consisting of the index for the 
text unit to be scored for lexical cohesion (i.e. #2#), 
and all the stcm-index records whose stem occurs in that 
text unit, as shown below. 

<Apple #2#, #3#, #3#, #5#>> 
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<company C#2#. #3#, #6#}> 
< #2# <eac«scut C#2#. #3#» > 
<look C#l#, #2#» 
<pairt'nar #2#>> 



Next« dLdentlaal index ooeorrences In the tuplet ore summad 
together, to g±ve tJio following results. 







#2# 


#3# 


J/ Tfl 


4if^ 




#2# 


3 




4 


0 


1 


1 



Table 2 



Index oaaiixrrBncee referring to the text unit being 
attttessed (i-^- #2#) are not counted as they do ttot register 
lexical cohesion (thus the second entry in the tahle is 
blank) . 

The oamo procedur-e of forming a tuplet and flumining 
identical 'index occurrences is then carried out for each 
other text unit. For example, the tuplet for text unit 
#6# iss* 

<#6# <oompany {#2#, #3#, #s#>>> 
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Th±s Is simpler -than tlja tuple t f oa; text unit #2# because 
company is the only stem which text unit #6# shares with 
any other text unit. Thia tuplet gives the 





#1# 


42U 


• #3# 


ij 111 




#6# 


#6# 


0 


1 


I 


0 


0 





tPhlff method is considerably faster than that of tha prior 
art heoausa it doaas not involva a comparison of every pair 
of text units for each word in the sample text. 

The final cohesion score of text units #2# and #6# is 
calculated by applying the scoring procediare of Figure 
1 to each row in table 2 and table 3 respectively, ScordLng 
a text unit according to this procedure involves adding 
the individual scores which are either above a threshold 
(for Level 1), or below the threshold and of the same 
magnitude (for lower levels) {The use of Levels in the 
procedure is discussed below) , 

Having discussed the way in which individual lexical 
cohesion scores (for each text unit pair) are calculated 
in step 16 using tuplets, we shall return to Figure 1 to 
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follow the procedure for calculation of the f Inaxl lexical 
cohesion score for each text nnit. However, before 
returning to Figure 1 It is notad that the simplest way 
of forming the final score would be to sum the individual 
scores for each text unit (i.e, for #2#'and #6#, sum each 
row in Tables 2 and 3 above), whilst ignoring all 
individual ecores below a certain threshold value. 
However, the procedure of Figure 1 goes further in that 
it determines not only a final score for each text unit, 
but altso a level for each text unit, as discussed below. 

The highest level is 1 ^ which indicates that the greatest 
individual score (for a given text unit) is above the 
threshold. The final score for that text unit is then 
, simply the sum of all individual scores (for that text 
unit 3 which are above the threshold- 

The meanings o£ level 1 and the next three levels below 
level 1, and the ways in which the final score for these 
levels is calculated, are shown in the table below. 
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Level 


Meaning of Level 


Final Score 


1 


Greatest individual score > threshold 


Sum of all individual scores above 
threshold 


* 0 


Greatest individual score « threshold 


Sum of all individual scores equal to 
threshold. 


^1 


Greatest individual score ™ fhre^hnld —1 


ouozL ujt axi iiKuviuiiai scores equal to 
threshold - I 


„2 


Greatest individual score *= tlireshold -2 


Sum of all individual scores equal to 
threshold - 2 



It w±XX be seen that ±f threshpld - 0, only level 1 
exists, and the final score for a given text unit is 
simply the sum of all individual scores for that text 
unit. In fact the total number of levels is equal to the 
threshold Hh 1. 



Some examples of individual scores, and the levels and 
final scores they produce (by following the procedure of 
Figure 1) for a threshold of 2 are given below- 



Individual scares 


Level 


Final Score 


20201 


0 


4 


liooo 


-1 


2 


5 6200 


1 


11 


1 11 I I 


-I 


5 



The purpose of calculating a level for each text unit is 
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to allow the text units to be ranked first according to 
level {highest level first) and second according to final 
score (highest final score first). In this way;, text 
units having no individual scores above the threshold are 
not necessarily ignored in the subsequent summarisattan 
process » 

Returning to Figure 1^ in step 18 the procedure branches 
into two depending on whether LC^"^ o , where LC'^'^ is the 
lexical cohesion score of the text unit currently being 
considered* A lexical cohesion score of zero between two 
text units (ie- LC^^^O) indicates that the two text units 
do not share any stems. If Lc'TU ^ q then the procedura 
gpes to step 20, As discussod below, the text unit ,index 
#B# is decremented by l at step 2 8 during each cycle of 
the procedure. At step 20, if #B# has reached 1 then #TU# 
is incremented by 1 in step 22. That is, the next text 
unit (in this case #2#) is assigned to #TU# . In step 24 
the procedure is stopped (at step 26) if #TO# has reached 
the maxixnum value +1 (i,e. 6+1 = 7 for our sample text), 
otherwise control passes back to step 14* 

At step 20 . if #B# has not yet been decreased to the 
first text unit (i.e, then control passes to step 
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28, in which #B# is decremented by 1 (ie. the n^xt lower 
text unit is assigned to #B#- 

It wiil therefore be seen that the effect of steps ID to 
2B is to calculate the individual lexical cohesion scores 
for all pairs of text units . 

Returning to step la, i£ LC*^^ does not equal 0, then 
control passes to step 30, which determines whether or 
not the scoring structure <Level, TD, Score> already 
exists. The first time that step 30 is reached no 
scoring structure will already exist, and control will 
pass to step 32, which determines whether CLevel is 
greater than 0 • Clievel is the current value of I-evol and 
is equal to {LC^^ - TRSH) , where TRSH is the lexical 
cohesion threshold^ which is selected in advance. In 
steps 34 and 36 values are assigned to the scoring 
structure according to the outcome of step 32^ and 
control ttien passes back to step 20 « 

At step 30, if the scoring stmcture already exists 
(which will always be the case except for the first time 
step 30 is reached for each value o£ T\J , given that the 
first time step 30 is reached values are assigned to the 
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scoring structure at steps 34 and 36 as described above), 
controX passes to step 3B which determines whatlnex- Level 
(i-e- the previous v^lue of CLeveX) Is gr&ater than 
CLevel. If so, control passes back to step 20 • Otherwise, 
control passes to step 40, which deitermines whether 
Level is equal to CLevel, If so, tiev valuee are assigned 
to the scoring structure in step 42, and control passes 
back to step 20, Otherwise, control passes to step ' 44 
(see Figure 2 ) , which determines whether CLevel Is greater 
than 0. If so, control passes to step 46, and new values 
are assigned to the scoring structure in step 48, or step 
50^ depending on whether the level is greater than 0, and 
control passes back to step 20. At step 44, if CLevel Is 
not greater than 0, control passes to step 52, which 
determines whether CLevel is less than, or equal to, 0. 
If step 52 is reached, the answer to this question should 
always be yea, ao that new values are assigned tq the 
scoring structure in step 54, and control is passed back 
to step 

Following the procedure of Figure 1 for all text units 
in the sample text, and a threshold of 2, the levels and 
final scores assigned to each text unit are as follows:- 
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TextUrdt 


Level 


Scare 




I 


3 




1 


7 




1 


4 






0 


#5# 


0 


2 


#6# 




2 



These px:ov±de the following ranking of tesct units in terms 
of lexical .cohesion . 



Rank 


Text Unit 


Level 


Score 


1" 


#2# 


1 


7 




#3# 


1 


4 




#1# 


1 


3 


4* 




0 


■ 2 


5«h 




-I 


2 


6* 


41 111 




0 



This shows the preferred order in which the text units 
will be selected in a sunimarisation proceisa . It is noted 
that no level is assigned to text unit #4#, as this text 
unit shares no stems with any other text unit. 



Whan used with a dictionary database providing 
information about the subject domain of words the method 
desoribed above can be slightly mpdified to detect the 
major themes and topics of a document automatically. 
As an example, the words in our sample text have the 
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following subjecst domain codes 



Word 



activeiy-adv 

business^n 

buy-v 

confiim-v 

company-n 

en]ployee>n 

exeeutive-n 

fiiendly-adj 

group-n 

independent-adj 

interest-n 

investor-n 

look-v 

maker-n 

maxket-a 

merger-n 

open-adj 

o-vm-v 

paztzieT'ii 

say-v 
8toek-n 

take-v 
talfc-n 



Associated Codes 



OR 
BZ 

MAR, MERG, MI 
CHR 

F^. SCG, TH 
LAB 
BZ, GOV 
FA,G. 

GROU, OR» POP 

CHT,FA 

BZ, EC, G, J, U 

IV, ON 

PHYA 

JC • 

B2,MAR 

MERG 

CER.PFE 

MEN 

PAPP 

DA. F, MOE, TG 
CN 

AH, AM. AP, BRE, FLW 
FOO, GU, IV, PM 
EC, PG, SH, V, "WRI 
RHE 



Tlio meamings of these codes are given below:- 



7 



> 
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CODH 



m 



AH 

AM 

AP 

BRE 

BZ 

CER 

CHR 

CHT 

CN 



DA 
EC 
F 

FA 
KLW 
FOO 
G 

GOV 
GROU 
GU 
IV 
J 

JC 
LAB 
MAR 
MEN 

MERG 

MGE 

MI 

ON 

OR 

PAPP 

PFE 

PO 

PHYA 

PM 

POP 

RHE 

SCO 

SH 

TO 

TH 

U 

V 

WRr 



Explanation 



Animal Fanning & Husbandry " 

Animal Names (not taxonomic tcnns (TAXI) 
Anthropology & Ethnology (incl racial groups) 
Breeds and Breeding 
Business & Conuncrce 
Ceremonies 
Christianity 

Character Traits (eg. meddlesome. meUow, outgoing) 
Coraroumcationa (eg. tclephoay. telegraphy, ■ 
audiovisual, infonnation science, radio) 
Dance & Choreograpliy 
Economics & Finance 
Finance & Business 

Overseas Politics <& International Relations 
Flower Names: plants known primarily as flowers 
Foods; all edible items 
Sports (incl Games & Pastimes) 
Govermnent Admin & Organisations (eg reahufScs) 
Groups of Musicians 
Guns 

Investment & Stock Markets 

Clime and &e Law 

Jndaeo-Christian Religion 

Staff and the Workforce (incl Labour relations) 

Maiketing & Merchandising 

Mental States & Feelings 

(eg. depressed, tense, nan>plussed) 

Mergers, MonopoHea, Takeovers, Joint Ventures 

Mamage, Divorce, Relationships & Infidelity 

Military (the anned forces) 

Occupations it Trades 

OrganisatiDns, Groups & Orders 

Paper & Stationery 

Banldng & Personal Finance 

Photography 

Animal physiology 

Plant Names 

Pop & Rock 

Rhetoric & Oratory (eg. ad hi?, eulogy, scripted) 

Scouting & Oirl Guides . r- / 

Clothing 

Team Games 

Theatre 

Politics, Diplomacy & Government 

Travel and Transport (incl. transport infraatnicture) 
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A further embodiment relating to subject analysis in- 
volves a method which is the ^ame as that described 
above* except that each word is first lemiuatisad (rather 
than Btemmed), and then replaced by all of the subject 
S domain codes associated with that word^ The individual 
scores f oi: pairs o£ text units are then calculated on the 
basis of shared codes rather than shared words, using 
code-index records* rather than stem-index records. 

10 However* an extra (disambiguation) step is required in 
order to avoid (or bx least reduce the chances of) 
counting codes which are out of context* that is codes 
which relate to senses of the word other than the intended 
sense. The disambiguatiion step involves dropping , text 

15 unit indexes from the code*index records of tuplets if 
they relate to the same word as the first element (i,e, 
text unit index) of the tuplet. This requires that the 
word associated with each text unit index in each 
code-indeoc record be remembered (le. recorded) by the 

20 procedure. This procedure can be demonstrated by the 
f o llo win g examp le . 

In the sample text the code BZ (Business & Commerce) is 
associated with the words; 
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executive occurring once in text units #2# and #3# 
tiuBlnoss occurring once in text unit #3# 
market occurring once in text unit #4# 



Con^eguently, a code-±ndojc record can be made where the 
subject domain code BZ let associated wltli these text unit 
indexes , that is : 

<BZ {#2« #3# »3« ft4# #5#>> 

The fulJL l.±Bt o£ code-index records for the sample text 
is shown below (instances where a code occurs in a single 
text un±t are removed as they do not zrepxresent lexical 
cohesion links). 



Interest occurring once in text unit #S# 



<BZ 



{#2# #3# #3# 



#4# #5#» 



<CN 



<#2# #3# #3# 



#4#» 



{#!#« #2#}> 



<F 



{#!# #2# #2$ 



#3# #6#}> 



<PA 



{#2# #5#» 



<GOV 



{#2# #3#» 



<IV {#4# #5#}> 



<MGE #2#>> 
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<MI {#2# #3# #6#>> 
<SCG C#2# #3# #6#>> 
<TG #2#>> 
<TH <#2# #3# #6#>> 

The fxarst tuplet ( disregajrdlng tlio disambiguation step 
tnantianed. -above) ±fi then: 

<DA {#!# #Z#}> 
<#!# <F £#!# #2# #2# #3# #6#>>> 
<KQB #2#>> 
<TG C#l# #2#}> 

and so on for the other tuplets. 

To simplify matters, in order to illustrate the 
disambiguation step, rather than calculate the individual 
saores £or each pair of text units, we Khali consider only 
the contaiibution to the individual scores which is mad© 
by one of tho codes, for example code BZ. The BZ 
components of all the tuplets are: 



<#2# <BZ #3# #3# #4# #S#>>> 

<#3# <BZ {#2# *^ #^ #4# #S#1>> 
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'Is,'!!:' 
-if"; 



y . 20 



<#4# <S2 {#2# #3# #3# #^ #5#}>> 

<#5# <BZ {#2# #3# #3# #4# *5#}>> 

wnere indeaces ar-o identical with tlie first index of each 
5 ^ tuplet are shown in strikethrough to Jindicate that they 
are exoliidedr as above » 

When allowance is made for the fact that each index is 
afliaociatftd with a particular word, the B2 components of 
XO the tuplets become t 

<#2 (executive) # <BZ {#3 (business*)* #4# #5#>> 
<#3 (executive) # <BZ {#4# #5#>>> 
<#3(businefiaJ#<BZ£#2(executlve)#. #4#, #5#>>>. 
<#4# <BZ {#2# #3# #3# #5#}>> 
15 <#5# <BZ £#2# #3# #3# #4#>>> 

Where the disambiguation step is illustrated above by 
ahowin^ indexes relating to worda identical with the first 
index of each tuplet in strikethrough to indicate that 
they are excluded. The final tuplets are then: 



<#2 (executive) # <BZ <#3 (business)* #4# #5#>> 
<#3 (executive) # <BZ {#4# #5#}>> nb . #2 (executive) # 
excluded. 

.<#3 {business) # <BZ {#2# #4# #5#>>> 
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<#4# <BZ {#2# #3# #3# #5#>>> 
<#S# <BZ {#2# #3# #3# #4#}>> 



The canti:ibut±on made by BZ to tho individual score js of 
* text unit pairs are then as follows ; 





#1# 


#2# 


#3# 




#5# - 


#6# 


#1# 




0 




0 


0 


0 




0 




1 


1 


1 


0 


#3# 


0 


I 




2 


2 


0 


II 'WIT 


0 


I 


2 




1 


0 




0 


I 


2 


1 




0 


#6# 


0 


0 


0 


0 


0 





When the saxne procedure ±s followed for certain other 
codes, such an DA^ PA# GOV etc, no valid tuplets result* 
This i» because the text unit indexes within the 
code -index records for these codes all relate to the same 
word* For example, the code GOV arises from the word 
^'execixtlve*' whion occurs in text units #2# and #3#, thus 
creating the code-index record <GOV C#2# #3#}> mentioned 
ahove. Because this code-index record does not form a 
valid tupleti the "Government" sense of the word 
executive* makes no contribution to the individual 
scores mentioned above. We have already seen that the 
Business sense of tha word "executive**^ docs make such 
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a contribution, which is the desired result becausa It 
Is the "Business " sense at the word which is intended in 
the sample text- The method thus achieves a degree of 
disambiguation of the subject domain codes, and rejects 
codes which are out of context. 

Only inst&nces where the words related to the same code 
differ in spelling are taken into account. This makes it 
possible to achieve higher precision in individuating 
salient themes /topics and assessing their relative 
importance. Taking the intersection of code sets for 
words with different spelling occurring in the same 
document tends to exclude cantextually inappropriate 
interpretations for the words. 

However « in cases where a word in the sample text is 
associated wi-th only one subject code, the disambiguation 
step is not carried out because no disambiguation is 
necessarsi. Hence the code CN* relating to the word '*say" 
remains 

The following table shows the text unit pairs which each 
code connects « 
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CODES 


TEXT UNIT PAIRS 




2-3 2-4 2-5 3-4 3-5 3-2 3-4 3-5 4-2 4-3 4-3 4-5 5-2 5-3 5-3 5-4 


F 


1-2 1-3 1-6 2-1 2-3 2-6 3-1 3-2 6-1 6-2 


FA 


2-5 5-2 


IV 


4-5 5-4 


.CN 


3-44-3 



Only five codes form valid tuplets, all the other codes 
being excluded (as described above}. 

In total, we haves 16 text unit pairs for B2, 10 for 
and 2 for FA and IV and CN. These data can be used to rank 
text units in the sample text in terms of topic aptness 
by adaptation of the procedure of Figure !• 

The total of all Individual scores for each subject domain 
code (eg* 16 for BZ, etc) can be converted Into percent age 
ratios to provide a topic/theme profile of the text as 
shown in the table below s- 



50% 


BZ 


Businesa & Commerce 


31^5% 


F 


Fiziance & Business 


6.25% 


IV 


Investment & Stock Markets 


6.25% 


FA 


Overseas Politics & Intemational Relations 


6.25% 


asr 


Cohxmunicaiions 
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■ IS 



Foir exzimple, the poircentage for BZ ±s calculated && 
16/(16+10+2+2+2) ^ 50% 

When used in a aummarlzation system, the level -based 
differentiation of text units obtained through the 
ranking procedure of Pig\ire 1 {whether baaed on words or 
on oodes} can be made to provide an automatic indioation 
of abridgement size, for example by automatic selection 
of all level 1 text units. 

Summary size can also be specified by the user, e.g, as 
a percentage of the original text size^ the selected text 
units being oboaen from eunong the ranked text units with 
higher levels and higher scores. 



The methods described can»also be used as indexing devices 
in various information systems such as information 
retrieval and information extraction systems. For 
example, jin a database comprising a large number of texts 
20 it is often desirable to provide a short abstract of each 
text to assist In both manual and computer searching of 
the database. The methods described above can be used to 
generate such short abstracts automatically. 
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The ranking method described above can also be applied 
taking into account additional ways of assessing lexical 
cohesion, which could be uaed at step IB of Figure 1, 
such aas 

the presence of synonymB across text units as established 
by consulting an electronic dictionaory of synon3nns ; 

the presence of words sharing the. same semantic indicators 
across text units as established by consulting an 
electronic dictionary, as in the example with subject 
domain codes discussed above; 

the presence of near- synonymous words across text units 
established by estimating the degree of semantic 
similarity between word pairs, as in the method 
disclosed in British Patent Application No .9 717 5 08 . 7 - 

the presence of anaphoric links across text units ^ i.e. 
links between a referential expression such as a 
pronoun or a definite description (e.g. The company 
in text unit #6#, and its antecedent (Apple in text 
unit #5#). 



> 
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The same ranking method described ±n the preferred 
embodiment can alsro be applied by using formatting 
commands as indicators of the relevance of particular 
types of text fragments. For example, text fragttients 
5 ' enclosed in formatting commands encoding titles and 
section headings fiuch as 

<h2>Reports Apple Looking for a J>artnor</h2> 

10 typically contain words which can be effectively used to 
provide an indication of the main topic in a text. These 
words can be given extra weight in the above method, and 
thus be uaed to assign additional textual relevance to 
text unite which contain them, e.g* by increasing 

15 further the lexical cohesion score of such text units 
during the ranking procedure described above , Formatting 
commands can also be selectively pirosarved so as to 
maintain as much of the page layout for the original text 
as possible. 



20 



The ranking method described above can also be applied 
by using lexnmatizing instead of stemming as a word 
tokenleatlon technique ^ or dispensing with word 
tokenization altogether* 
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The same ranking method can also be applied to testtc 
written In a language other than English, by providing 

a list of stop words for the language; 

a stammer of lemmatlzer for the language; and 

any additional means for assessing lexical cohesion 

±n the language such as semantilc similarity and 

anaphoric links 

Figure 3 shows sehematieally a system suitable for 
carrying out the methods described above. The sysrem 
comprises a programmable data processor 70 with a program 
memory 71 , for instance in the form of a read only memory 
ROM, storing a program for controlling the data processor 
70 to perform, for example, the method illustrated in 
Figures 1 and 2. The system further comprises non« 
volatile read/write memory 72 for storing, for example, 
the list of atop words and the subject domain codes 
mentionedi above. Working or scratch pad memory for 
the data processor is provided by random access memory 
(RAM) 73. An input interface 74 is provided, for instance 
for receiving commands and data. An output interface 75 
is provided* for instance, for displaying information 
relating to the progress and result of the procedure- 
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A l:ext sample may be auppliaci via the input interface 74 
or may optionally be provided, in a machine-readablo ctore 
76. A thesaujtrua and/or a dictionary may be supplied in 
the read only memory 71 or may bo supplied via the input 
interface 74- Alternatively, an electronic or 

machine -readable thesaurus 77 and iain electronic or 
machine -readable dictionary 78 may be provided- 

The program for operating the ays.tem and for performing 
the method described hereinabove is stored in the program 
memory 71. The program memory may be embodied as 
fiaemiconductor memory^ for- instance of ROM type as 
described above. However, the progr^lm may be stored in 
any other suitable storage medium^ such ais floppy disc 
71a or CD-ROM 71b. 

INDUSTRIAL APPLICABILXTY 

The use of the structures according to the present 
invention considerably reduces the time taken to operate 
on the text because it is no longer neeessary to count 
the number of strings shared between all possible pairs 
of text units in turn. 
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'lit '• 



it"™: 



More Speoiflcally, the degree of csonnectivity of a. text 
unit with all other teaet units in a text can be simply 
assessed by quantifying the elements (e.g. words) which 
each text unit shares with pairs built by associating each 
5 element in the text with the list of pointers to the text 
units in which th© element: occurs. This provides a 
significant advantage in terms of processing speed when 
compared to a method such as the one described by Hoey 
(1991) and Collier (1994) where the same assessment is 
10 carried out by computing all pairwise combinations of text 
units- In particular, the word-per-second processing 
rate is significantly less affected t>y text size. 
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CLAIMS 

1. A method of operating on a text aomprlsing a plurality 
of text units, each comprising one or more strings, the 
method being characterised bys 

forming a structure for each of at least some of said 
strings, in which structure a ©tring is associated with 
each pair of text units in which the string occurs ; 

for each pair of text unite s\ainming the niimber of 
occurrences of each other text unit in the same structure 
or structures so as to form an individual score for each 
pair of text units: and 

processing said individual scores for each pair of text 
units in order to form a final score for each pair of text 
units to determine how many times any string is shared 
between each pair of text units and other text units* 

2. A method of operating on a text as claimed in claiml, 
which includes the further step of ran}cing the text units 
on the basis of said individual scores. 
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3. A method of operating on a text as clalmea in 
ala.ixa 1, wherein said text units are sentences, said 
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strings aire words forming said sentenods, and the method 
comprisecsr the a.ddlt±onal steps of removing, stop-words^ 
stemming each remaining word and indexlzig the sentences 
prior to carrying out said summing step, and wherein said 
structures are stem-index records each comprising a 
stemmed word and one or more indexes corresponding to 
sentences in which said stemmed word occurs • 

4. A method of operating on a text as claimed in claim 1^ 
wherein said text Is associated with a word text comprising 
words, each word being associated with one or more subject 
codes representing subjects with which said word is 
associated, and wherein said strings are subject codes 
associated with said words « 

5« A method of operating on a text as claimed in claim 4, 
which comprises the further step of Jceeping a record of 
the word spelling associated with each occurrence of a 
subject code in a text unit, and wherein during said 
summing step occurrences of the same subject code in a 
pair of text units zure disregarded if the same word 
spelling is associated with said same subject code in said 
pair of text units. 



6. A method of operating on a text as olalmed in 
Qlaim 5, wherein said step of disregarding ocaur- 
renaes of subject codes is not carried, out for aubjeat 
codes which relate to only a single word spelling in the 
word text* 

7» A method of operating on a text as claimed in 
claim 1, wherein said processing step includes 
calculating a level for each text unit, in addition 
to said final score, and wherein said level indicates 
the value of the highest of said individual scores in 
relation to a threshold value* 

a , A storage medium containing a program for controlling 
a programmable data processor (70) to perform a method 
as claimed in claim 1. 

9, A system for ranking text units in a text, the system 
comprising a data processor (70) programmed to perform 
the steps of the method of claim 1- 
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